This notebook tries to provide a one-click-runs-all codes for data loading, processing, EDA, clusters and figure plotting.

Data Loading and Processing

The folder structure of this project should be like:

├── (Your name for the project folder)
    ├── README.md
    ├── data
    │   ├── Introducing_the_Enron_Corpus.pdf
    │   ├── enron_mail_20150507
    │   │   └── maildir
    │   └── klimt-ecml04.pdf
    ├── figures
    │   ├── email_all_cluster.png
    │   ├── email_all_network.png
    │   ├── email_inbox_cluster.png
    │   └── email_inbox_network.png
    ├── final_project.Rmd
    ├── final_project.html
    ├── notebooks
    │   └── experiments.Rmd
    ├── results
    │   └── community_assignments.csv
    └── scripts
        ├── data_loading.R
        ├── download_enron.R
        ├── starting_code.R
        └── trial_igraph.R

Make sure the folders are correctly structured as above for reproducibility concerns!

# set the paths and working directory
notebook_path <- rstudioapi::getActiveDocumentContext()$path 
mother_path <- dirname(dirname(notebook_path))

# set the path for the project, data, scripts, etc.
Sys.setenv(mother_path = mother_path)
script_path <- paste0(mother_path, "/scripts")
data_path <- paste0(mother_path, "/data")
results_path <- paste0(mother_path, "/results")

Now download the data (if needed) and load all the data we need for further investigation. If you want to load the data by yourself or using different filters, please check the /scripts/data_loading.R for customized operations.

# detect if the data is already downloaded, if not, then download and untar it.
if (!file.exists(paste0(data_path, "/enron_mail_20150507"))) {
    message("No dataset detected, start downloading, please wait patiently.")
    source(paste0(script_path, "/download_enron.R"))
    download_data()
  }

# load all the data we need for analyses
load(paste0(results_path, "/dfs.Rdata")) 

Before we look into the data, we first clarify two maybe confusing concepts:

Using these two concepts, we make a simple explanation on the content of the dataframes and the way to get them:

The data loading and processing procedure for all.within.fromto.df is as following:

  1. List all the mail files’ paths and form a dataframe.

  2. Read the From lines of the emails, extract only the emails sent by users within company. Use the user’s name as To, and form a dataframe containing all the From and To information.

  3. Create the users dataframe by extracting all the mailnames they use, and match the list with their name.

  4. Filter all the emails which From and To users are both within the company (i.e. included in the users).

REMARK: There is a sub-folder in the “/sent_items/” folder in pereira-s named “clickathome”, after checking the only email’s content in it (an advertisement), we choose to remove it from our investigation.

Exploratory Data Analysis

We conduct EDA on both inbox mail data and all mail data. For both dataset, we plot some exploratory figures to see if there are clear patterns or interesting trends.

Inbox mails

First we plot histograms of the number of emails every user sent/received.

We also plot a figure for the number filtered by 50, just for better visualization.

# make plots for filtered data
filtered_sent <- inboxes.within.fromto.df %>%
  group_by(from) %>%
  filter(n() >= 50) %>%
  mutate(num_from = n()) %>%
  ungroup() %>%
  mutate(from = factor(from, levels = names(sort(table(from), decreasing = TRUE))))
  

hist_sent <- ggplot(filtered_sent, aes(x = from)) +
  geom_bar() +
  labs(title = "histogram of inbox emails sent within company (>=50)", x = 'Sent by:') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

filtered_receive <- inboxes.within.fromto.df %>%
  group_by(to) %>%
  filter(n() >= 50) %>%
  mutate(num_to = n()) %>%
  ungroup() %>%
  mutate(to = factor(to, levels = names(sort(table(to), decreasing = T))))

hist_received <- ggplot(filtered_receive, aes(x = to)) +
  geom_bar() +
  labs(title = "histogram of inbox emails received within company(>=50)", x = 'Sent to:') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot(hist_received)

plot(hist_sent)

It is quite interesting that, grigsby-m, the one who sent the most emails to others’ inboxes does not even show up in the filtered figure of inbox email received. From some outside source, we know that grigsby-m is actually titled as VP Trading, ENA Gas West. No wonder he would send tons of emails to other users. Now we cross-check some of the busiest inbox users:

common_users <- intersect(filtered_sent$from, filtered_receive$to)
print(common_users)
## [1] "watson-k"     "tycholiz-b"   "dasovich-j"   "whitt-m"      "nemec-g"     
## [6] "shackleton-s" "heard-m"

Among them, tycholiz-b is VP Trading, ENA Gas West, shackleton-s is VP ENA & Senior Counsel, dasovich-j is Dir State Government Affairs, heard-m is Specialist Legal.

Also, we check the mail clusters and social network plots:

set.seed(321)
# Get the unique senders and recipients
all_names <- unique(c(inboxes.within.fromto.df$from, inboxes.within.fromto.df$to))

# Create a contingency table with a predefined set of row and column names
mail_count_table <- table(factor(inboxes.within.fromto.df$from, levels = all_names), factor(inboxes.within.fromto.df$to, levels = all_names))

# print(table_inboxes_from_to_within)

mail_count_df <- melt(mail_count_table)

# Create a heatmap using ggplot2
ggplot(mail_count_df, aes(x = Var1, y = Var2, fill = log(value))) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Email Interaction Heatmap", x = "From", y = "To", fill = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Create a graph from the matrix
graph <- graph_from_adjacency_matrix(mail_count_table, mode = "undirected", weighted = TRUE, diag = FALSE)
# Filter edges with low weight (e.g., below a threshold)
# graph <- delete_edges(graph, E(graph)[weight < 5])
community <- cluster_louvain(graph)
png(paste0(mother_path, "/figures/email_inbox_cluster.png"), width=3200, height = 3200)
plot(community, graph, vertex.size=4)
dev.off()
## quartz_off_screen 
##                 2
layout <- layout_with_fr(graph)  # Fruchterman-Reingold layout (often better for clarity)
png(paste0(mother_path, "/figures/email_inbox_network.png"), width = 3200, height = 3200)  # Width and height in pixels


# Plot the network graph with adjustments
plot(graph, 
     vertex.size = 5,         # Larger nodes
     vertex.label.cex = 0.8,  # Adjust text size
     edge.width = E(graph)$weight,  # Edge width based on the weight (email count)
     layout = layout,         # Use the new layout for better node spacing
     main = "Email Interaction Network",
     vertex.label.color = "black", # Change label color for contrast
     vertex.color = "lightblue",  # Node color
     edge.arrow.size = 0.5,   # Adjust arrow size on edges
     edge.color = "gray",     # Edge color
     vertex.label.dist = 1,   # Distance between label and node
     vertex.frame.color = "white")  # Frame color around nodes

# Close the PNG device (save the plot)
dev.off()
## quartz_off_screen 
##                 2
Inbox Cluster
Inbox Cluster
Inbox Network
Inbox Network

In the inbox cluster/network figure, it seems that causholli-m forms a single cluster herself. Let’s check what happened to causholli-m?

causholli_inbox <- inboxes.within.fromto.df %>%
  filter(to == 'causholli-m')

causholli_inbox
##          from          to
## 1 causholli-m causholli-m

It is just one email she sent herself, and that’s why she is forming an isolated group. This also enlightens us, considering only the inbox mails is far from enough!!


From now on, we focus on all the emails within the enron company:

We also make the same histograms to view the top senders/receivers:

Now the results are quite convincing when we consider all the mails within the users. The highest in both From and To, is Kay Mann (mann-k), who was the head of legal for Enron. The fact that she sent so many emails is ironical, seeing as how Enron was breaking every law in the book. Besides, the newly-added users are germany-c, Capacity Trader, jones-t, Senior Legal Specialist, scott-s, Assistant Trader, sager-e, VP & Assistant General Counsel.

common_users <- intersect(filtered_sent$from, filtered_receive$to)
print(common_users)
##  [1] "arnold-j"      "beck-s"        "dasovich-j"    "shackleton-s" 
##  [5] "mann-k"        "jones-t"       "bass-e"        "lenhart-m"    
##  [9] "scott-s"       "fossum-d"      "symes-k"       "germany-c"    
## [13] "nemec-g"       "perlingiere-d" "sager-e"       "rodrique-r"   
## [17] "stclair-c"

We also plot the clusters and network using all the email sending/receiving information.

set.seed(123)
# Get the unique senders and recipients
all_names <- unique(c(all.within.fromto.df$from, all.within.fromto.df$to))

# Create a contingency table with a predefined set of row and column names
mail_count_table <- table(factor(all.within.fromto.df$from, levels = all_names), factor(all.within.fromto.df$to, levels = all_names))

# print(table_inboxes_from_to_within)

mail_count_df <- melt(mail_count_table)

# Create a heatmap using ggplot2
ggplot(mail_count_df, aes(x = Var1, y = Var2, fill = log(value))) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Email Interaction Heatmap", x = "From", y = "To", fill = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Create a graph from the matrix
graph <- graph_from_adjacency_matrix(mail_count_table, mode = "undirected", weighted = TRUE, diag = FALSE)
# Filter edges with low weight (e.g., below a threshold)
# graph <- delete_edges(graph, E(graph)[weight < 5])
community <- cluster_louvain(graph)
png(paste0(mother_path, "/figures/email_all_cluster.png"), width=3200, height = 3200)
plot(community, graph, vertex.size=4)
dev.off()
## quartz_off_screen 
##                 2
layout <- layout_with_fr(graph)  # Fruchterman-Reingold layout (often better for clarity)
png(paste0(mother_path, "/figures/email_all_network.png"), width = 3200, height = 3200)  # Width and height in pixels


# Plot the network graph with adjustments
plot(graph, 
     vertex.size = 5,        # Larger nodes
     vertex.label.cex = 0.8,  # Adjust text size
     edge.width = E(graph)$weight,  # Edge width based on the weight (email count)
     layout = layout,         # Use the new layout for better node spacing
     main = "Email Interaction Network",
     vertex.label.color = "black", # Change label color for contrast
     vertex.color = "lightblue",  # Node color
     edge.arrow.size = 0.5,   # Adjust arrow size on edges
     edge.color = "gray",     # Edge color
     vertex.label.dist = 1,   # Distance between label and node
     vertex.frame.color = "white")  # Frame color around nodes

# Close the PNG device (save the plot)
dev.off()
## quartz_off_screen 
##                 2
All Cluster
All Cluster
All Network
All Network

Look further into the clusters

From the email_all_cluster figure, we pick two best performing clusers (with human eye), get the members and export them into cluster1 and cluster2. Now under cluster1 and cluster2, we have the names, email file directories and mailnames of these users within.